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ABSTRACT 



A method for prefetching the data and parity blocks for 
generating parity data of a stripe. The method uses a 
low and high thresholds marker indicative of a first and 
second level of fullness of the cache to determine 
whether or not to prefetch the data and parity blocks. If 
the cache is filled to a level exceeding die first level of 
fullness, the data and parity blocks are prefetched for 
any blocks to be written to the disk drive between the 
low and high threshold. The data and parity blocks are 
read from the disk drive at a lower processing priority 
in anticipation of the writing of the block. 

10 daims, 4 Drawing Sheets 
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probability that the requested data are already stored in 
DATA AND PARUy PREFETCHING FOR the cache. 

REDUNDANT ARRAYS OF DISK DRIVES However, in RAID products the process of writing 

data and parity data to the disk drives is still exception- 

FIELD OF THE INVENTION 5 tedious. In a one dimensional parity protection 

This invention relates to the control of multiple inex- scheme, which is the simplest form of parity protection, 

pensive disk drives for use with a computer system, and one parity block is generated for each stripe. The parity 

more particularly to a system for generating parity data block for a particular stripe is generated by XORing the 

for multiple disk drives. |q data stored in a block with the data stored in the other 

uAi-vr^priTTMn to thp TKrvcxiTTOM ^^^^ «^P®- Typically, the parity block of a 

BACKGROUND TO THE INVENTION ^^^^ ^ ^^^^ ^ ^^^^^^^ ^.^^ ^^^^ 

It is a problem in the field of computer systems to which store the data blocks of the stripe. Thus, should 
provide an inexpensive, high performance, high reliabil- one of the disk drives fail, the data can be recovered, 
ity, and high capacity disk storage device. Traditional j5 stripe by stripe, from the parity block and the blocks 
high performance and high capacity disk devices have stored on the surviving disk drives. Similarly, any parity 
typically used single large expensive disks (SLED) ^ji^^k stored on a failed drive can easily be regenerated 
having form factors in the range of 12 or 14 inches. f^^^j blocks on the surviving drives. 

The rapid acceptance of personal computers has ere- process of writing a new data block generally 

ated a market for inexpensive small form factor drives, 20 involves the following steps: a) reading the old data 
such as 5i, 3i inch, or smaller. Consequently, a disk ^^^^ ^^^^ ^ ^^^^ ^ replaced by the new data 
storage devi^J^n^P^sing a redundant array of inex- ^^^^ ^ y^j^ ^ ^^at 

pensive disks (RAID) has become a viable alternative ^^^^ ^^^^ generating new parity 

for storing large amounts of data. f,om the old data, the old parity data, and the new 

RAID products subsutute many sm^^^ 25 d,^. writing the new data block; and e) writing the 

a few large expensive disks to provide higher storage t bl k 

capacities The drawback y^f^i^J^^^^^^^ Tn Ser woVds, at the time that new data are written 

disk with, for example, a hundred small disks, is rehabil- ... .. ^ • * ^-^-^ kit, ♦^i^-iur 
ity. In a disk storage device consisting of many disk 'o the disk dnve, traditional P'^^^^^^^ 
drives, there is a much higher probability that one of the 30 reqmre four UO requests to^^^^ single block of data, 
drives will fail making the device inoperable. However, «tardmg actual host I/O throughput, 
by means of data redundancy techniques, the reliability Therefore, it is desirable to provide a system for 
of RAID products can be substantially improved. RAD^ which reduces the impact of havmg to process 

RAID products typically use parity encoding to sur- additional I/O requests to generate parity data at the 
Vive and recover from disk drive failures. Different time that new data is written a disk drive, 
levels of RAID organizations using parity encoding are SUMMARY OF THE INVENTION 

currently known, see "A case for redundant arrays of 

inexpensive disks" David A. Patterson et al.. Report The present invention provides a system which im- 
No. UCB/CSD 87/891, December 1987, Computer proves the I/O performance of a computer system in- 
Science Division (EECS), Berkeley, Calif. 94720. Par- ^ eluding a central processor unit or *'host". an array of 
ity data are generated by XORing data to be written disk drives and a memory buffer cache. The disk drives 
with previously stored data and previously stored par- store data m stripes, each stripe includes a plurality of 
ity data. RAID parity protection suffers from the inher- data blocks and a parity block for storing parity data 
ent problem that the number of I/O requests, read and generated from the dato blocks. The cache stores data 
writes, that must be serviced to write data are many read from, and data to be written to the disk drives. The 
more than would be the case with non-RAID disks. cache is organized into blocks compatible with the 

Striping and caching are well-known techniques to block structure of the disk drives. The blocks in cache 
improve the I/O throughput in RAID products using ^rc managed in a least recently used (LRU) manner. A 
parity protection. Striping involves the concurrent 50 low and a high threshold signal are provided to indicate 
transfer of a "stripe" of data to and from disk drives. ^j^^ relative fullness of the cache. 
With striping, an I/O request to transfer a stripe of data system for prefetching data blocks and parity 

is distributed over a group of disk drives, that is, each of ^^^y^ ^j^^ ^gj^ threshold signals to deter- 

the disk drives transfers generally concurrently, a block whether or not to prefetch the data and the parity 

of the dau. For example, if there are 5 disk dnves m the 55 ^^^^ ^^^^ ^ ^^^^ threshold 
array, and if a stripe is defined to mclude 5 blocks the ^ ^^icctcd. a prefetch procedure is started to read 

entire smpe can be written to the disk dnves m about ^ necessary to generate parity data, 

l/5th the amount of time tf one block of the stnpe is ^^^^ ^^^^^^ J^^^ ^^^^^^ ^ 

wntten to each of he disk dnves ^"^uir*^! y^ ^ ^^^^ ^^^^ ^ ^^^^ ^ 

Stripmg IS typically used in combination with a mem- 60 . • 1 j • 1.1 * w 

ory b«fre7 cJhe or "cache" to take advanUge of the wntte" to the d«k dnve. the block to be ov^wntten 
principles of locaUty of reference, which are well and correspondmg panty block of the stnpe are 
known in computer prograroining. These principles read into cache Each block which has been prefetched 
indicate that when data stored at one location are ac is marked as such. Prefetchmg continues, until the high 
cessed, there is a high probabUity that data stored at 65 threshold signal is detected. The high threshold signal 
physically adjacent locations will be accessed soon causes the blocks in cache to be wntten to the disk 
afterwards in time. By having a cache the number of drives in LRU order, until sufficient blocks have been 
physical I/O transfers are reduced since there is a high written to disk to disable the low threshold signal. 
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drives. With striping, a host distributes the data transfer 

BRIEF DESCRIPTION OF THE DRAWINGS ^^^^^ ^^^^^^ ^^c disk drives 21-25. In the 

FIG. 1 is a block diagram of a computer system and preferred embodiment, the stripe 61 is equal to the 
RAID configuration encompassing the present inven- amount of data that is transferred when one block 41 is 
tion; 5 transferred for each of the disk drives 21-25, for exam- 

FIG, 2 is a block diagram of the RAID configuration pie five blocks 41. 
of FIG. I; Now. RAID type devices which include a large num- 

FIG. 3 is a block diagram of a LRU list according to ber of disk drives have a fairiy high probability that one 
a preferred embodiment of the invention; and of the disk drives in the array will fail. Therefore, parity 

FIG. 4 is a flow chart of a method to prefetch data 10 encoding is typically used to recover data that may be 
and parity blocks in anticipation of writing data to the lost due to a disk drive failure. For example, one of the 
RAID configuration of FIG. 2. blocks 41 of a stripe 61, the "parity blocks stores parity 

data. The parity data in the parity block is generatedi 
DETAILED DESCRIFTION OF THE example, by taking the exclusive OR (XOR) of the 

PREFERRED EMBODIMENT jj blocks 41. To ensure date recovery, the parity block is 

Referring now to the drawings, FIG, 1 shows a com- usually stored on a disk drive different than one which 
puter system generally indicated by reference numeral stores a data block of the stripe from which the parity 
1, The computer system 1 includes a central processor data is generated. In RAID level 5, as described in 
unit or "host" 10 having primary temporary data stor- Patterson ct al., the parity blocks are interleaved among 
age, such as memory 11, and secondary permanent data 20 all of the disk drives 21-25 to lessen contention for any 
storage, such as a disk device 20. The host 10 and the one of the drives. 

disk device 20 are connected by a communication bus To take further advantage of the principles of locality 
30. The computer system 1 also mcludes a memory of reference, the computer system 1 is provided with 
buffer cache (cache) 40 also connected to the system the memory buffer cache (cache) 40. Presumably, the 
bus 30. 25 host 10 can access data stored in a semiconductor cache 

The host 10 is generally conventional and is of the considerably faster than data stored on the disk drives 
type that supports a multiple number of concurrent 21-25. Data frequently used by the host 10 are retained 
users executing a wide variety of computer applica- in cache 40 for as long as possible to decrease the num- 
tions, including database applications which use the disk ber of physical I/O requests to transfer data between 
device 20 for storing data. During operation of the 30 the host 10 and the disk drives 21-25, and also to allow 
computer system, the host 10 issues I/O requests, such data aggregation prior to writing data to the disk drive 
as reads and writes, to transfer data between memory 11 21-25. 

and the disk device 20 via the bus 30. For purposes of Accordingly, and again with reference to FIG. 1, the 
illustrations only, and not to limit generality, this inven- computer system 1 further includes the cache 40. In the 
tion will be described with reference to its use in the 35 preferred embodiment, the cache 40 comprises, for 
disk device 20 which is organized as a RAID device as example, 4 megabytes (MB) of random access memory, 
described in the Patterson et al. paper. However, one Host 10 I/O read requests transfer "old" data from the 
skilled in the art will recognize that the method of the disk drives 21-25 to the cache 40, and from the cache 40 
invention may also be used in storage devices organized to the memory 11. Host 10 I/O write requests store 
in different manners. 40 modified or "new" data in the cache 40, and physical 

FIG. 2 shows, in schematic block diagram form, a I/O write requests transfer the new data from the cache 
disk device 20 organized in a RAID fashion as de- 40 to the disk drives 21-25, generally some time thereaf- 
scribed in the Patterson et al. paper. The disk device 20 ter. While the new data are stored in the cache 40, that 
comprises a controller 29 connected to the system bus is, before the new data are written to permanent storage 
20, and a plurality of, for example five, disk drives 45 on the disk drives 21-25, the new data are vulnerable to 
21-25. corruption due to, for example, power or system fail- 

The storage space of the disk drives 21-25 is physi- ures. For this reason, the cache 40 is relatively cxpen- 
cally organized into, for example, sectors, tracks, and sive non-volatile memory. Memory space in the cache 
cylinders. However, in order to simplify access by the 40 for storing new data is allocated to the users in quan- 
host 10, the storage space of the disk drives 21-25 is also 50 titles equal to the size of a block 41. 
organized into a set of sequentially numbered logical For some applications, for example database applica- 
blocks, generally indicated by reference numeral 41. By tions, where the amount of data read is much larger 
using logical blocks 41, the details of the physical orga- than the amount of data that is written, it may be advan- 
nization of the disk drives 21-25, for example, the num- Ugeous to partition the cache 40 into a larger read 
ber of sectors per track, the number of tracks per cylin- 55 cache and a smaller write cache. That portion of the 
der, and the physical distribution of all data across the cache 40 which is used for storing data read from the 
drives 21-25, do not need to be known by the users of disk drives 21-25 can be of less expensive volatile mem- 
thc host 10. In the preferred embodiment, a block of ory, since the data, in case of a failure, can easily be 
data is equal to the amount of data that can be conve- restored from the disk drives 21-25. 
niently transferred between the host 10 and the disk 60 In any case, the limited amount of memory of the 
drives 21-25 with a single I/O request, for example, a cache 40 is typically managed in a least recently used 
single or multiple number of sectors. manner (LRU). LRU algorithms are well-know in corn- 

To improve the I/O throughput of the disk device 20, puter programming, and can be implemented in any 
the data are further organized into yet larger sections of number of ways. In general, an LRU algorithm deallo- 
data, known as "stripes," generally indicated by refer- 65 cates memory space in aged order. That is, memory 
ence numeral 61. Striping techniques are well known in space storing data which were least recently used 
RAID devices, and generally involve the generally (LRU) is deallocated and made available for other uses 
concurrent reading and writing of data to several disk before memory space storing data which were most 
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recently used (MRU), used meaning any acxcss, read or ing periods of relatively low processor activity, reduc- 
write to the data. In the case of deallocation of new ing the I/O demand during peak write load, and yield- 
data, that is, data not yet stored on the disk drives ing lower write latency and better write throughput. 
21-25, deallocation involves the step of writing the data In FIG. 4, there is shown a flow chart of a prefetch 
to disk. 5 procedure 100, according to one embodiment of the 

In general, an LRU algorithm is implemented by invention, for prefetching data blocks and parity blocks, 
means of a LRU list 70, as shown in FIG. 3. The LRU In step 110, the computer detennines if the cache 40 has 
list 70 is nothing more than an ordered set of entries 71, exceeded the first level of fullness. That is, is the low 
each entry 71 referencing an allocatablc amount of threshold signal LT 74 present, 
cache memory space, for example, blocks 41. The 10 If the answer is no, that is, the LT 74 is not present, 
entries 71 are logically sequenced in the LRU list 70 in or the cache 40 has not exceeded a first level of fullness, 
aged order by means of entry links 72 and 73, The entry the procedure 100 is done, otherwise, for each entry 71 
links 72 each reference the next LRU entry 71, and the the LRU list 70, in LRU order, beginning with step 
entry links 73 reference the previous MRU entry 71. ijq processing the steps 120-140 at a lower processor 

In addition, in order to monitor the amount of cache IS priority. 
40 that is allocated to active use, a low threshold signal step 120, the computer detennines if there is a next 

(LT) 73 and a high threshold signal (HT) 74 are pro- oldest entry 71 in the LRU list 70. If there is not, the 
vided as foUows. When the cache 40 reaches a prcdeter- procedure 100 is done, otherwise perform step 130. 
mined first level of fullness, for example 80% ftill, the g^^p computer determines if the data and 

Ht 71 is generated. When the cache 40 reaches a prede- 20 ^^^^^ corresponding to the next entry 71 have 
termined second level of fulhiess, for example 90% full, prefetched. 

the LT 75 is generated. . , . ^ If the answer in step 130 is yes, that is, the data and 

Appropriate corrective procedures are mvokedwhra j^^^ prefetched continue with step 

the cache 40 reaches the first and second level of fuU- otherwise in step 140, read the data and parity 

ness One procedure that IS started when the cac^^ 25 ^ ^ prefetched. Then proceed 

filled to the first level of fullness is an opportumstic data * 

and parity prefetch procedure which reads data and procedure minimizes the impact of the 

panty blocks required ^^^^^ exUa I/O requests required to generate parity data for 

blocks 41 about to be wnrten. W^ien the ^^^^ ^^^^ ^^^^ne that a block of data is to be 

filled to the second level of fulhiess, sufficient blocks 41 30 . ^i,. -jn 

are deallocated in the LRU order, to disable the low 'X^^t^ J^^^^^^l.e. the second level of f«ll- 

D^oclSo^bl^cks 41 to be written to the disk ^^'^^ ''^K^^^'^.^^^IZ^'^^ 

« . ^, ^ r J A « sienal causes blocks 41 to be deallocatea irom the cacne 

dnves 21-25 requires the generation of panty data. As ^ . , , , i 

1 +^ i«f/^rTTia 1^ 40 until sufficient blocks 41 have been deallocated to 

previously stated, m order to generate panty miorma- 35 i. * u « k«i/x^w - 

iion, generally involves the steps of: a) reading the old f«^f the cache 40 to be ^ *5>.^^^^^ 
data block from the disk drive, that is the data block that l^^el of fullness. Smce the all of the ^^«ted bl^^^^ 
will be replaced by the new data block stored in cache; necessary to gen^ate panty date have been prefetch^^^^ 
b) reading the old parity block from tiie disk drive; c) thenumberof I/O requests at the time of de^^^^ 
generating the new parity block by XORing the old 40 greatly reduced, thereby miprovmg the system perfor- 
dau block with the old parity block and the new data 1 ^ . ^ u a ^ ^ ^ 

block; d) writing the new data block; and e) writing the ^ While there has been shown and descnb«3 a pr^ 
new parity block ^^"^^ embodiment, it is understood that vanous other 

The prefetch procedure, according to one embodi- adaptations and modifications may be made within the 
ment of the invention, reads the old data and parity 45 spint and scope of the mvention. 
blocks for any blocks 41 in the cache 40 which store What is claimed: , ^. ^ ^ , , 
new data to be written to the disk drives, before the new An apparatus for prefetching date from a plurahty 

date block is aged as the LRU block, thereby minimizes of disk drives each of the plurality of disk dnves orga- 
the effect of steps a) and b) above. The prefetch proce- nized into a plurality of disk blocks, the plurality of disk 
dure uses the "fulhiess" of the write cache 40 to deter- 50 Wocks further orgamzed mto a plurality of stnpcs, each 
mine when to prefetch old date blocks and old parity of the plurality of stnpcs including at least one disk 
^jI^j,^ block from each of the plurahty of disk dnves, each of 

As new date blocks 41 are added to the cache 40, the the plurahty of stripes furtiier including a parity disk 
LT 74 and HT 75 are provided when the first and sec block storing parity date generated from the date stored 
ond level of fulhiess are reached. The prefetch proce- 55 in the other disk blocks of the stnpe. the apparatus 
dure is involked at a lower processing priority, when comprising: 

the cache 40 reaches the first level of fulhiess. That is, memory means for stormg data, said memory means 
when the LT 74 is detected after the first level of full- partitioned into memory blocks compatible with 

ness has been exceeded, the prefetch procedure is the disk block organization of the plurality of disk 

started. The procedure determines, for cadh entry 71 in 60 drives; 

the LRU list 70 in LRU order, if the corresponding old means for maintaining said memory blocks m a least 
date block and old parity block are stored in the cache recentiy used (LRU) order; 

40. If they are not, the prefetch routine reads the old means for generating a low threshold signal in re- 
date block and the old parity block into the cache 40 sponse to a first predetermined number of said 
and marks them as prefetched. 65 memory blocks being filled with data; and 

In other words, the prefetch routine opportunistically means, responsive to said low threshold signal, for 
uses any available processor time, prior to new date starting a prefetch procedure to read disk blocks 

being written, to execute anticipated I/O requests dur- associated with a predetermined stripe, said prede- 
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termincd stripe including at least one disk block to storing data in a memory means» said memory means 

be overwritten by a LRU memory block. partitioned into memory blocks compatible with 

2. The apparatus as in claim 1 wherein said disk the disk block organization of the plurality of disk 
blocks associated with said predetermined stripe in- drives; 

elude the parity disk block of said predetermined stripe 5 maintaining said memory blocks in a least recently 

and said at least one disk block to be overwritten by said used (LRU) order; 

LRU memory block. generating a low threshold signal in response to a first 

3. The apparatus as in claim 1 further including means predetermined number of said memory blocks 
for marking said LRU memory block as being pre- being filled with data; and 

fetched. 10 starting, in response to said low threshold signal, a 

4. The apparatus as in claim 1 further including means prefetch procedure to read disk blocks associated 
for generating a high threshold signal in response to a with a predetermined stripe, said predetermined 
second predetermined number of said memory blocks stripe including at least one disk block to be over- 
being filled with data, said second number being greater written by a LRU memory block. 

than said first number; and 15 7. The method as in claim 6 further including the 

means for generating parity daU from said LRU steps of reading the parity disk block of said predeter- 

memory block, said at least one disk block to be mined stripe, and reading said at least one disk block to 

overwritten, and the parity disk block; and be overwritten by said LRU memory block, 

means for storing said parity data in a parity memory 8. The method as in claim 6 further including the step 

block; and 20 of marking said LRU memory block as being pre- 

means for writing said LRU memory block and said fetched, 

parity memory block to said plurality of disk 9. The method as in claim 6 further including the step 

drives. of generating a high threshold signal in response to a 

5. The apparatus as in claim 1 further including means second predetermined number of said memory blocks 
for reading disk blocks associated with a second prede- 25 being filled with data, said second number being greater 
termined stripe, said second predetermined stripe in- than said first number; and 

eluding at least one disk block to be overwritten by a generating parity data from said LRU memory block, 

next LRU memory block. said at least one disk block to be overwritten, and 

6. A method for prefetching data from a plurality of the parity disk block; and 

disk drives, each of the plurality of disk drives orga- 30 storing said parity data in a parity memory block; and 

nized into a plurality of disk blocks, the plurality of disk writing said LRU memory block and said parity 

blocks further organized into a plurality of stripes, each memory block to said plurality of disk drives, 

of the plurality of stripes including at least one disk 10. The apparatus as in claim 6 further including 

block from each of the plurality of disk drives, each of reading disk blocks associated with a second predeter- 
the plurality of stripes further including a parity disk 35 mined stripe, said second predetermined stripe includ- 

block storing parity data generated from the data stored ing at least one disk block to be overwritten by a next 

in the other disk blocks of the stripe, the method com- LRU memory block, 

prising the steps of; * » ♦ • • 
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